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and Ponipeu Fabra University 

We consider the hypothesis testing problem of deciding whether 
an observed high-dimensional vector has independent normal compo- 
nents or, alternatively, if it has a small subset of correlated compo- 
nents. The correlated components may have a certain combinatorial 
structure known to the statistician. We establish upper and lower 
bounds for the worst-case (minimax) risk in terms of the size of the 
correlated subset, the level of correlation, and the structure of the 
class of possibly correlated sets. We show that some simple tests 
have near-optimal performance in many cases, while the generalized 
likelihood ratio test is suboptimal in some important cases. 

1. Introduction. In this paper we consider the following statistical prob- 
lem: upon observing a high-dimensional vector, one is interested in detecting 
the presence of a sparse, possibly structured, correlated subset of compo- 
nents of the vector. Such problems emerge naturally in numerous scenarios. 
The setting is closely related to Gaussian signal detection in Gaussian white 
noise, on which there is an extensive literature surveyed in [20]. In image 
processing, textures are modeled via Markov random fields [13] , so that de- 
tecting a textured object hidden in Gaussian white noise amounts to finding 
an area in the image where the pixel values are correlated. Similar situa- 
tions arise in remote sensing based on a variety of hardware. A related task 
is the detection of space-time correlations in multivariate time series, with 
potential applications to finance [1]. 

1.1. Setting and notation. We investigate the possibilities and limita- 
tions in problems of detecting correlations in a Gaussian framework. We 
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may formulate this as a general hypothesis testing problem as follows. An 
n-dimensional Gaussian vector X = {Xi, . . . , X„) is observed. Under the null 
hypothesis Hq, the vector X is standard normal, that is, with zero mean 
vector and identity covariance matrix. To describe the alternative hypoth- 
esis Hi, let C be a class of subsets of {l,...,n}, each of size k, indexing 
the possible "contaminated" components. One wishes to test whether there 
exists an S" G C such that 

Co\{Xi,Xj) = {p, i^j, with i,j G S, 
1 0, otherwise, 

where /> > is a given parameter. Equivalently, if X = {Xi, . . . , X„) denotes 
the vector of observations, then 

Ho:Xr^M{0,I) vs. Hi-.X ^M{0,As) for some SgC, 

where I denotes the n x n identity matrix and 

r 1, i = j, 

(1.1) {As)i,j = <.p, iy^ j, with i,j eS, 

1 0, otherwise. 

We write Pq for the probability under Hq (i.e., the standard normal measure 
in M") and, for each S cC, Fs for the measure of AA(0, A5). 

The goal of this paper is to understand for what values of the parameters 
{n,k,p) reliable testing is possible. This, of course, depends crucially on 
the size and structure of the subset class C. We consider the following two 
prototypical classes: 

• k-intervals. In this example, we consider the class of all intervals of size k 
of the form {i, . . . ,i + k — l} modulo n — for aesthetic reasons. (We call such 
an interval a k-interval.) This class is the flagship of parametric classes, 
typical of the class of objects of interest in signal processing. 

• k-sets. In this example, we consider the class of all sets of size /c, that is, 
of the form {^i, . . . , i^} where the indices are all distinct in {1, . . . , n}. (We 
call such a set a k-set.) This class is the flagship of nonparametric classes, 
and may arise in multiple comparison situations. 

Our theory, however, applies more generally to other classes, such as: 

• k-hypercubes. In this example, the variables are indexed by the d-dimen- 
sional lattice, that is, X = [Xi :i € {1, . . . ,m}'^), so that the sample size 
is n = m'^, and we consider the class of all hyper-rectangles of the form 
X ^^^{zg, . . . ,is + ks — 1} — each interval modulo m — of fixed size Y[t=i = 
k. This class is the simplest model for objects to be detected in images 
(mostly d = 2,3 in applications). 
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• Perfect matchings. Suppose n is a perfect square with k'^ = n. Tlie com- 
ponents of the observed vector X correspond to edges of the complete 
bipartite graph on 2k vertices and each set in C corresponds to the edges 
of a perfect matching. Thus, \C\ =k\. In this example C has a nontrivial 
combinatorial structure. 

• Spanning trees. In another example, n = (^^2 ^) and the components of X 
correspond to the edges of a complete graph K^+i on /c + 1 vertices and 
every element of C is a spanning tree of Kk+i- 

As usual, a test is a binary-valued function — )• {0, 1}. If f{X) = 0, 
then the test accepts the null hypothesis Hq; otherwise Hq is rejected by /. 
We measure the performance of a test based on its worst-case risk over the 
class of interest C, formally defined by 

^max(^) = Po{/(^) = 1} + maxP5{/(X) = 0}. 
We will derive upper and lower bounds on the minimax risk 

A standard way of obtaining lower bounds for the minimax risk is by putting 
a prior on the class C and obtaining a lower bound on the corresponding 
Bayesian risk, which never exceeds the worst-case risk. Because this is true 
for any prior, the idea is to find one that is hardest (often called least favor- 
able). Most classes we consider here are invariant under some group action: 
fc-intervals are invariant under translation and k-sets are invariant under 
permutation. Invariance considerations ([21], Section 8.4) lead us to consid- 
ering the uniform prior on C, giving rise to the following average risk: 

R{f) = Fo{f{X) = 1} + Fi{f{X) = 0}, 

where 

Pi{/(X)=0} :=lj]P5{/(X)=0}, 

and N := \C\ is the cardinality of C. The advantage of considering the average 
risk over the worst-case risk is that we know an optimal test for the former, 
which, by the Neyman-Pearson fundamental lemma, is the likelihood ratio 
test, denoted /*. Introducing 

(1.2) Zs = exp(iX^(I-A^i)X) 

for all 5" G C, the likelihood ratio between Hq and Hi may be written as 



(1.3) L{X) 



s&c ^ ^ 
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and the optimal test becomes 

f*{x) = if and only if L{x)<l. 

Note that KqZs = \/ det(A5). The (average) risk R* = R{f*) of the optimal 
test is called the Bayes risk and it satisfies 

Sec 

Note that, with the only exception of the case of spanning trees, in all 
examples mentioned above, the minimax and Bayes risks coincide, that is, 
iZ* = R^^^. This is again due to invariance ([21], Section 8.4). (The class 
of spanning trees is not sufficiently symmetric for this equality to hold. 
However, as we will see below, even in this case, R* and R^^^ are of the 
same order of magnitude.) 

We focus on the case when n is large and formulate some of the results in 
an asymptotic language with n — t- 00 though in all cases explicit nonasymp- 
totic inequalities are available. Of course, such asymptotic statements only 
make sense if we define a sequence of integers k = kn and classes C = C„. 
This dependency in n will be left implicit. In this asymptotic setting, we 
say that reliable detection is possible (resp., impossible) if -R™^^ — )• (resp., 
— )■ 1) as n — )• 00. 

Remark (Covariance structure). In this paper we assume that, under 
the alternative hypothesis, the correlation between any two variables in the 
"contaminated" set is the same. While this model has a natural interpre- 
tation (see Lemma 1.1 below), it is clearly a restrictive assumption. This 
simplification is convenient in understanding the fundamental limits of de- 
tection (i.e., in obtaining lower bounds on the risk). At the same time, the 
tests we exhibit also match these lower bounds under more general correla- 
tion structures, such as 

C = l, i = j, 
(1.4) {As)i,j<>p, j, with z,j g5, 

1 = 0, otherwise. 

That said, dealing with more general correlation structures remains an inter- 
esting and important challenge, relevant in the detection of textured objects 
in textured background, for example. 

1.2. Relation to previous work. The vast majority of the literature on 
detection is concerned with the detection of a signal in additive (often 
Gaussian) noise, which would correspond here to an alternative where Xi ~ 
M^j,, 1) for i £ S, where > is the (per-coordinate) signal amplitude. We 



R* = l- ^Eo\LiX) - l| = 1 - ifio 
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call this the detection- of-means setting. The literature on this problem is 
quite comprehensive. Indeed, the detection of fe-intervals and fc-hypercubes is 
treated extensively in a number of papers; see, for example, [4, 6, 10, 14, 22]. 
A more general framework that includes the detection of perfect matchings 
and spanning trees is investigated in [2] , and the detection of fc-sets is studied 
in [7, 16-19]. In the literature on detection of parametric objects, the phrase 
"correlation detection" usually refers to the method of matched filters, which 
consists of correlating the observed signal with signals of interest. This is 
not the problem we are interested in here. While the problem of detection- 
of- correlations considered here is mathematically more challenging than the 
detection-of-means setting, there is a close relationship between the two. 
The connection is established by the representation theorem of [8] — stated 
here for the case Gaussian random variables. 

Lemma 1.1 ([8]). Let Xi, . . . ,Xk be standard normal with Cov{Xi,Xj) = 
p fori^j. Then there are i.i.d. standard normal random variables, denoted 
U,Ui, . . . ,Uk, such that Xi = ^J^U + \J\ — pUi for all i. 

Thus, given U, the problem becomes that of detecting a subset of variables 
with nonzero mean (equal to ^/pU) and with a variance equal to 1 — p 
(instead of 1). This simple observation will be very useful to us later on. 
When U is random, the setting is similar to that of detecting a Gaussian 
process (here equal to ^/pU for i G S, and equal to otherwise) in additive 
Gaussian noise. However, the typical setting assumes that the Gaussian 
process affects all parts of the signal [20] . In our setting, the signal (the subset 
of correlated variables) will be sparse. Since we only have one instance of the 
signal X, the problem cannot be considered from the perspective of either 
multivariate statistics or multivariate time series. If indeed we had multiple 
copies of X, we could draw inspiration from the literature on the estimation 
of sparse correlation matrices [9, 12], from the literature on multivariate 
time series [23], or on other approaches [15]; but this is not the case as 
we only observe X. Closer in spirit to our goal of detecting correlations 
in a single vector of observation is the paper of [3], which aims at testing 
whether a Gaussian random field is i.i.d. or has some Markov dependency 
structure. Their setting models communication networks and is not directly 
related to ours. 

It transpires, therefore, that p in the detection-of-correlations setting plays 
a role analogous to /i^ in the detection-of-means setting. While this is true 
to a certain extent, the picture is quite a bit more subtle. The detection-of- 
means problem for parametric classes such as /c-intervals is well understood. 
In such cases, /x^ needs to be of order at least (1/A;) log(n/A;) for reliable 
detection of A;-intervals to be possible. This remains true in the detection- 
of-correlations setting, and the generalized likelihood ratio test (GLRT) is 
near-optimal, just as in the detection-of-means problem; see, for example, [6]. 
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Our inspiration for considering A;-sets comes from the line of research on 
the detection of sparse Gaussian mixtures. Very precise results are known 
on {n,k,ii) that make detection possible [7, 18, 19] and optimal tests have 
been developed, such as the "higher criticism" [16, 17]. In fact, the recent 
paper [11] deals with heteroscedastic instances of the detection-of- means 
problem where the variance of the anomalous variables may be different 
from 1. For example, it is known that, when n = 0{k'^) [resp., /c^ = o(n)], 
fj? needs to be of order at least n/fe^ [resp., log(n)] for reliable detection 
of fc-sets to be possible, and the test based on '^^Xi (resp., maxjXj) is 
near-optimal. Though more precise results are available when k'^ = o{n), 
these cannot be translated immediately to our case via the representation 
theorem of Lemma 1.1. As a bonus, we show that the GLRT is clearly 
suboptimal in some regimes — see Theorem 3.1. Note that in the detection- 
of-means problem it is not known whether the GLRT has any power. 

1.3. Contribution and content of the paper. This paper contains a col- 
lection of positive and negative results about the detection-of-correlation 
problem described above. In Section 2 we derive lower bounds for the Bayes 
risk. The usual route of bounding the variance of the likelihood ratio, that is 
very successful in the detection-of-means problem, leads essentially nowhere 
in our case. Instead, we develop a new approach based on Lemma 1.1. We es- 
tablish a general lower bound for the Bayes risk in terms of the moment gen- 
erating function of the size of the overlap of two randomly chosen elements of 
the class C. This quantity also plays a crucial role in the detection-of-means 
setting and we are able to use inequalities worked out in the literature in 
various examples. In Section 3 we study the performance of some simple 
and natural tests such as the squared-sum test — based on (X^j^j)^, the 
generalized likelihood ratio test (GLRT) and a goodness-of-fit (GOF) test, 
as well as some variants. We show that, in the case of parametric classes 
such as fc-intervals and /c-hypercubes, the GLRT is essentially optimal. The 
squared-sum test is shown to be essentially optimal in the case of k-sets 
when k'^/n is large, while the GLRT is clearly suboptimal in this regime. 
This is an interesting example where the GLRT fails miserably. When /c^/n 
is small, detection is only possible when p is very close to 1. We show that 
a simple GOF test is near-optimal in this case. The analysis of tests such 
as the squared-sum test and the GLRT involves handling quadratic forms 
in X . This is technically more challenging than the analogous problem for 
the detection-of-means setting in which only linear functions of X appear 
(which are normal random variables). 

2. Lower bounds. In this section we investigate lower bounds on the 
risk, which are sometimes called information bounds. First we consider the 
special case when C contains only one element as this example will serve 
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as a benchmark for other examples. Then we consider the standard method 
based on bounding the variance of the hkehhood ratio under the null hy- 
pothesis, and show that it leads nowhere. We then develop a new bound 
based on Lemma 1.1 that has powerful implications, leading to fairly sharp 
bounds in a number of examples. 

2.1. The case N = 1. As a warm-up, and to gain insight into the prob- 
lem, consider first the simplest case where C contains just one set, say 
S = {!,..., k}. In this case, the alternative hypothesis is simple and the 
likelihood ratio (Neyman-Pearson) test may be expressed by 

f*{X) = if and only if - A^i)X < logdet(As). 



This follows by the fact that EZ5 = det(A5) which is easy to check by 
straightforward calculation. 

The next simple lemma helps understand the behavior of the Bayes risk. 

Lemma 2.1. Under Pq, - Ag^)X is distributed as 

p 2 I P(^-l) .,2 



l-p-'^-i l + p(A:-l) 
and under the alternative P5, it has the same distribution as 

-pxl-i + p{k-l)xl, 

where Xi cii^'d xl-i denote independent random variables with degrees of 
freedom 1 and k — 1, respectively. 

Proof. If y = (Yi, . . . , Yn) denotes a standard normal vector, then un- 
der Hq, the quadratic form — Ag^)X is distributed as 1^-^(1 — Ag^)Y, 

and under the alternative, it has the distribution of Y'^{As — l)Y, since X 
1/2 

is distributed as Y. 

Now, observe that for any symmetric matrix B with eigenvalues Ai, . . . , A,i, 
the quadratic form Y^^Y has distribution 

n 

(2.1) Y^BY r^'^XiY^. 

i=l 

This follows simply by diagonalizing B and using the rotational invariance 
of the standard normal distribution. 

The lemma follows from this simple representation and the fact that A5 
has eigenvalue 1 — p with multiplicity k — 1, 1 + p{k — 1) with multiplicity 1, 
and the eigenvalue 1 with multiplicity n — k. □ 



Now it is straightforward to analyze the Bayes risk. In particular, we 
immediately have the following: 
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Proposition 2.1. If C is a singleton, liuik^ooR* = if and only if 
pfc — 7- oo . Similarly, limfe_>>oo R* = 1 if and only if pk ^ 0. 

Proof. Suppose pfc — )• oo. It suffices to show that there exists a thresh- 
old Tfc such that Fo{X^(l - Ag^)X > t^} ^ and ¥s{X'^{I - A~^)X < 
Tk} ~^ 0. We use Lemma 2.1 and the fact that, by Chebyshev's inequahty, 

^{\xl-k\>tkVk}^0, k^oo, 

for any sequence — )• oo, and the fact that 

P{*fc ^ < Xi < 4} 1 as/c— ^oo. 

We choose tk = log A; and define := —pk + ptkVk + tk- Then under the 
null, 

n{X^{l-A'g^)X>Tk}^Q, 
and under the alternative, setting rjk := —pk — ptk\fk + pkf^^ , 
P5{X^(I-A^i)X<%}^0. 

We then conclude with the fact that, for k large enough, < ry^. 

If pk is bounded, the densities of the test statistic under both hypotheses 
have a significant overlap and the risk cannot converge to 0. 

The proof of the second statement is similar. □ 

Clearly, the role of n is immaterial in this specific example as the optimal 
test ignores all components whose indices are not in S = {1, . . . ,k}. 

2.2. The moment method. When the class C contains more than one 
element, the likelihood ratio with uniform prior on C is given by (1.3). 
A common approach for deriving a lower bound on the Bayes risk is via 
an upper bound on the variance of L{X) under the null. Indeed, by the 
Cauchy-Schwarz inequality, 

Eo|L(X)-l| ^ _ v/Eo[L(A)2]-l 

2 - 2 ■ 

Therefore, an upper bound on E,q[L{X)'^] — 1 = YaTo{L{X)) leads to a lower 
bound on R* . 

Let A = det(A5) = {1 - p)''~^ (1 + p{k - 1)) , which is independent of 5 G C. 
By Fubini's theorem, we have 

KoL{Xf = j^ MZsZs'), 

S,S'£C 

where Zs is defined in (1.2). We focus on terms of the double sum for which 
5 = 5'. 
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The following result is a straightforward consequence of the representa- 
tion (2.1) and the well-known expression for the moment generating function 
of X?- 

Lemma 2.2. Suppose X is a standard normal vector in M" and M is an 
n X n symmetric matrix with eigenvalues strictly less than 1/2. Then 

Eexp(X^MX) = det(I - 2M)-^/^. 

//M has an eigenvalue exceeding 1/2, then Eexp(X'^MX) = -|-oo. 

Since M := I — A^^ has eigenvalue —p/ (1 — p) with multiplicity k, eigen- 
value p{k — 1) /(I + p{k — 1)) with multiplicity 1, and eigenvalue with mul- 
tiplicity n-k, Eo[^|] = Eo exp{X^MX) = +oo unless p{k - 1) < 1. The im- 
plications are rather insubstantial. It only shows that, when p{k — 1) < 1 — e 
with e > fixed, the Bayes risk does not tend to zero. As we shall see, this 
lower bound is grossly suboptimal, except in the case where C is a singleton 
(as in Section 2.1) or does not grow in size with re. 

A refinement of this method consists in bounding the first and second 
truncated moments of L(X), again under the null hypothesis. For example, 
this is the approach used in [11, 18] in the detection-of- means setting for 
the case of A;-sets to obtain sharp bounds. Unfortunately, in our case this 
method only provides a useful bound when the class C is not too large (i.e., 
has size polynomial in k) while it does not seem to lead anywhere in the 
case of fc-sets. The computations are quite involved and we do not provide 
details here, as we were able to obtain a more powerful general bound that 
applies to both A;-intervals and A;-sets. This is presented in the next section. 

2.3. A general lower bound. In this section we derive a general lower 
bound for the Bayes risk. As in the detection-of-means problem [2, 4, 5], 
the relevant measure of complexity is in terms of the moment generating 
function of the size of the overlap of two randomly chosen elements of C. 
In the detection-of-means setting, this is a consequence of bounding the 
variance of the likelihood ratio. We saw in Section 2.2 that this method is 
useless here. Instead, we make a connection between the two problems using 
Lemma 1.1. 

Theorem 2.1. For any class C and any a > 0, 

R* > P{|A'(0, 1)1 < a}(l - iv'Eexp(i/, Z)-l), 

where Va := pc? j (\^ p) — \ log(l — p^) and Z = \S r\ S'\, with S, S' drawn 
independently, uniformly at random from C. In particular, taking a =1, 

R* > 0.6 - 0.3^/E^^^(J^^^y^, 

where vi = v{p) := p/{l + p) -\ log(l - p^) . 
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Proof. The starting point of the proof is Lemma 1.1,'^ which enables 
us to represent the vector X as 



where U,Ui, . . . ,Un are independent standard normal random variables. 

We consider now the alternative Hi{u), defined as the alternative Hi 
given U = u. Let R{f), L, f* [resp., Ruif), L^, f^] be the risk of a test /, 
the likelihood ratio, and the optimal (likelihood ratio) test, for Hq versus Hi 
[resp., Hq versus Hi{u)]. For any it S M, Ru[fu) < Ruif*), by the optimality 
of /* for Hq versus Hi{u). Therefore, conditioning on U , 

R* = R{f*) 

= EuRu(.n 
>EuRu{fu) 

= l-iEc/Eo|Lc/(X)-l|. 

[E[/ is the expectation with respect to U ~AA(0, 1).] Using the fact that 
Eo|Lm(X) — 1| < 2 for all u, we have 

EuEo\Lu{X) -1\<2F{\U\> a} +F{\U\<a} max Eo|L„(X) - 1| 

nS[— a,a] 



and therefore, using the Cauchy-Schwarz inequality, 

1 

— may 

2 a 



l-^Ef;Eo|Lt;(X)-l|>P{|f/|<a}( 1-^ max Eo|L„(X) - 1| 



>n\U\<a}(l-^ maxyEoL2(X)-lY 



Since 



]^L(r37)^«-p(-L 2{i-p) -z^tJ^^p 

56C ^ ^ i&S ^ ' ' iiS ^ \t=l 



(I_p)fc/2^^P(5^ 2 2(l-p) 



we get 



2 



s,s'&c ' ' ' Hesns' ^ 



^In fact, we only need to assume that X is as described in distribution. 
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2 2(1 -p) 



J- y I 

E 



expl 



2 1-p 

^ 2(1 -p ^2 

i&SAS' ^ ^' i4SUS' ^ 



It is easy to check that 



Xi 



2 1-p 1 + p 2(1 -p)V 1 + py' 

which implies 

2.^^ 1 V- exp((pnV(l + p))|Sn5'|) 



^ 2(1 



^ 2 1-p) ^ 2 / 
exp((pnV(l + P))\S n / 1 - p\ l^"^''/' 



x(l-p)'=-|5n5'| 



which concludes the proof. □ 

We now apply Theorem 2.1 to a few examples. The theorem converts the 
problem into a purely combinatorial question and [2] offers various estimates 
for the moment generating function of Z which we may use for our purposes. 

2.3.1. Nonoverlapping sets. Consider first the simplest case when C con- 
tains N disjoint sets of size k. 
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Corollary 2.1. Let C be the class of all sets of size k. If 

log(iV) 



k 



then the Bayes risk satisfies R* > 0.3, and i?* — )• 1 if p<^ min(l,log(A^)//c) 
or — )• oo. 

Proof. Clearly, the size Z of the overlap of two randomly chosen ele- 
ments of C equals zero with probability 1 — and k with probability \/N . 
Thus, 

^e"^ - 1 = (l/iV)(e^*^ - 1) < (l/iV)e'^^ 

which is bounded by 1 if < \og{N)/k. The first part then follows from the 
second part of Theorem 2.1. For the second part, we need to find a — )■ oo such 
that Uak — log N — t- — oo. (Note that in this case the upper bound above tends 
to zero.) First assume that /o ^ min(l, log(A^)/A;). In that case, Ua ~ pa^ , 
so it suffices to take a — t- oo slowly enough that pa^ <C min(l,log(A/")/A:). 
Next assume that h := log(l — p) + 21og(A^)/A; — )• oo. In this case, we have 
i^a 1^0? — (1/2) log(l — p), and we simply choose a — )• oo slowly enough that 
-oo. □ 

2.3.2. k-intervals. Consider the class of all /c-intervals. The situation is 
similar to that of nonoverlapping sets. (In fact, since this class of fc-intervals 
contains [n/k] nonoverlapping sets of size k, we could immediately deduce 
a lower bound via Corollary 2.1.) 

Corollary 2.2. Let C be the class of all k-intervals. If 

log(n/(2A:)) 



k 



then the Bayes risk satisfies R* > 0.3, and R* ^ 1 if p<ti min(l,log(n/A;)//c) 
or if (1 — p){n/k)'^/^ — > oo. 

Proof. For two A;-intervals chosen independently and uniformly at ran- 
dom, 

v{\sr^s'\=^} = ^ v^ = i,...,fc. 

Thus, 



\f.= l 

and proceed as in the proof of Corollary 2.1, using the fact that N <n. □ 
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2.3.3. k-sets. Consider the class of all sets of size k. 

Corollary 2.3. Let C be the class of k-sets. If 

^ ln2 



n exp(i/(/9)) — 1 

then the Bayes risk satisfies R* > 0.3, and i?* — )■ 1 if either k'^/n — )■ oo and 
pk'^/n 0, or (1 - p)n^ /k"^ oo. 

Proof. By [2], Proposition 3.4, which uses negative association, 

^.e""^ <({e^ + <expf(e'^-l) — 

\ n J \ n 

where the last expression is bounded by 2 under the postulated condition, 
and tends to 1 if either /c^/n — t- oo and vk? jn — )• 0, or jn^ and e^k? jn — t- 
0. First assume that /c^/n — t- oo and plP' jn — t- 0. By choosing a — )• oo slowly 
enough that pa^k? jn — t- we ensure that Va k? jn^^. Next assume that h := 
log(l — p) — 21og(A;^/n) — > oo. Since Va <a? — (1/2) log(l — p), it suffices to 
take a — )• oo slowly enough that a^ — 6/2— )•— ooto ensure that e'^fe^/n — >• 0. 
The result then follows from Theorem 2.1. □ 

2.3.4. Perfect matchings. Consider now the example of perfect match- 
ings described in the Introduction. Here k = ^/n. Once again, Theorem 2.1 
applies and implies that testing is impossible for moderate values of p. 

Corollary 2.4. LetC he the class of all perfect matchings. If p< 1/2, 
the Bayes risk satisfies R* > 0.3. Also, /?* — t- 1 i/ p — t- 0. 

Proof. The random variable Z for this class is considered by [2], who 
prove that 

/ ^ \^ 

Ee"^ <i{e'' -1)^ + 1] <e^"'\ 
In I 



This is bounded by 2 whenever v < 1 + In In 2, which is satisfied whenever 
p < 1/2, and tends to 1 if —t- 0. We then apply Theorem 2.1. □ 

2.3.5. Spanning trees. A similar argument applies for the class of all 
spanning trees of a complete graph with A; + 1 vertices [and n = {k + l)k/2 
edges] as described in the Introduction. 

Corollary 2.5. Let C be the class of all spanning trees. If p< 0.4, then 
the Bayes risk satisfies R* > 0.15. We also have i?* — )• 1 z/ p — )• 0. 
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Proof. It is shown in [2] that 

which is bounded by 13/4 whenever v <1 -|-hi((ln(13/4))/2), which is satis- 
fied whenever p < 0.4, and tends to 1 if — )• 0. We then apply Theorem 2.1. 

□ 



3. Some near-optimal tests. We already know that the likelihood ratio 
test is optimal in the Bayesian setting. We study here other tests for mul- 
tiple reasons. First, the likelihood ratio test seems difficult to compute in 
most situations. Second, the likelihood ratio test is heavily dependent on the 
prior we choose — here, the uniform distribution on the class. The third, and 
perhaps most important, reason is that it is difficult to obtain directly up- 
per bounds for the (worst-case) risk of the likelihood ratio test whereas the 
tests considered below are easier to analyze and often yield near-optimal 
performance. Whenever we obtain an upper bound for the risk of a test 
that matches the lower bounds developed in the previous section, we have 
a full understanding of the limitations and possibilities of detection for the 
particular case considered, and this is our main goal in this paper. 

We consider the squared-sum test, which corresponds to the ANOVA 
test in the detection-of-means setting, the generalized likelihood ratio test 
(GLRT) and a goodness-of-fit (GOF) test, as well as some variants. We say 
that a test is near-optimal for a certain setting if it achieves the information 
bound for that setting to first order. 

3.1. The squared-sum test. One of the simplest tests is based on the 
observation that the magnitude of the squared-sum (X^"^^ ■^i)'^ may be sub- 
stantially different under the null and alternative hypotheses due to the 
higher correlation under the latter. 

Indeed, under Fq, {J27=i-^i)'^ distributed as nxf, while for any S C 
{1, . . . ,n} with |5| = k, under P5, (ELi^*)^ has the same distribution as 
(n -|- pk{k — l))xi; ill fact, under the more general correlation model (1.4), 
this is a (stochastic) lower bound. This immediately leads to the following 
result. 

Proposition 3.1. Let C be an arbitrary class of sets of size k and 
suppose that pk'^/n — t- 00 m (1-4)- If tn is such that t„ — t- 00 but t^ = 
o{pk'^/n), then the test which rejects the null hypothesis if {Y17=i-^i)'^ ^ 
ntn has a worst-case risk converging to zero. However, any test based on 
(Sr=i^«)^ is powerless ifpk^/n^O in (1.1). 

In Corollary 2.3, we saw that reliable detection of fc-sets is impossible if 
k"^ /n — )■ 00 and pk"^ /n — )■ 0. Here we see that, when pk"^ /n — )• 00, the squared- 
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sum test is asymptotically powerful. Hence, the following statement: 
The squared-sum test is near-optimal for detecting k-sets in the 



On the other hand, in the regime k^/n^ 0, the squared-sum test is powerless 
even ii p = 1. The test does not require knowledge of p, though knowing p 
allows one to choose the threshold t„ in an optimal fashion; if p is unknown, 
we simply choose t„ — t- very slowly. 

3.2. The generalized likelihood ratio test. In this section we investigate 
the performance of the generalized likelihood ratio test (GLRT). We show 
that for parametric classes such as fc-intervals, the test is near-optimal. How- 
ever, for the nonparametric class of /c-sets, the test performs poorly in some 
regimes. 

By definition, the GLRT rejects for large values of maxs^c Zs/^o^s, or 
simply max5gc Zs when all the sets in the class C are of same size, since ^o^s 
only depends on the size of S. Hence, the GLRT is of the form 



for some appropriately chosen t. We immediately notice that the GLRT 
requires knowledge of p 

Our analysis of the GLRT is based on Lemma 2.1, which provides the dis- 
tribution of the quadratic form X'^(l — A^^)X under the null Pq and under 
the alternative P5. Under the null we need to control the maximum of such 
quadratic forms over S £ C, which we do using exponential concentration 
inequalities for chi-squared distributions. 

3.2.1. The GLRT for k-intervals and other parametric classes. Recall- 
ing Corollary 2.2, when detecting fc-intervals all tests are asymptotically 
powerless when p <^ mm(l,log(n/k)/k). We assume for concreteness that 
A;/ log n— 7- 00, for otherwise detecting /c-intervals for very small k has more 
to do with detecting fc-sets. We state a general result that applies for classes 
of small cardinality. 

Proposition 3.2. Consider a class C of sets of size k, with cardinal- 
ity N ^ 00 such that \og{N)/k — )• 0. When pk/ log N — )■ 00, the generalized 
likelihood ratio test with threshold value t = —pk + p\/5k\ogN + 2 log N has 
worst-case risk tending to zero. 

Proof. We first bound the probability of Type I error. Indeed, under 
the null, by Lemma 2.1 and its proof, we can decompose 




f{X) = if and only if maxX^ {I - Ay)X < t 



X^{I-A-')X 



P 



Cs + 



p{k-l) 



Ds, 



1-p 



l + p{k-l) 
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where Cs ~ xl-i and Ds ~ Xi- Hence, 

maxX"^(I — ATq'^^X < — pmmCs + maxL's". 

sec ^ sec sec 

It is well known that the maximum of standard normals is bounded by 
^/2TogN with probability tending to 1 as N ^ oo. Hence, the second term 
on the right-hand side is bounded by 21ogA^ with high probability. For the 
first term, we combine the union bound and Chernoff 's bound to obtain, for 
all a < 1 , 

Po|minC75 < a (A; - 1)} < NP{xl_i < a{k - 1)} 

(3.1) 

<A^exp(-.^^ -{a -l-loga 



2 

Using the fact that a — 1 — log a ~ ^ (1 — a)^ when a — )• 1, the right-hand side 
tends to zero when a =1 — y^(5/fc) log N. We arrive at the conclusion that 
the GLRT with threshold t = —pk + pyJbklogN + 2 log N has probability of 
Type I error tending to zero. 

Now consider the alternative under P^. By Lemma 2.1 and Chebyshev's 
inequality, 

- A.g^)X > -pk - pskVk + pk/sk 

with high probability when Sfc — )■ oo. We then conclude by the fact that the 
right-hand side is larger than t when Sfc — )• oo sufficiently slowly. □ 

Comparing the performance of the GLRT in Proposition 3.2 with the 
lower bound for ^-intervals in Corollary 2.2, we see that the GLRT is near- 
optimal for detecting A;-intervals. This is actually the case for all parametric 
classes we know of. 

3.2.2. The GRLT for k-sets and other nonparametric classes. Consider 
now the example of the class of all fc-sets. Compared to the previous section, 
the situation here is different in that A^, the size of the class C, is much larger. 
For example, for /c-sets, N = (^) , and therefore log{N)/k — > oo with n — ?■ oo. 
The equivalent of Proposition 3.2 for this regime is the following: 

Proposition 3.3. Consider a class C of sets of size k, with cardinality 
N ^oo such that log{N)/k^oo. When r] := {1 - p)N^/''{logN)/k ^ 0, the 
generalized likelihood ratio test with threshold value t = — (log A^)/-^/?? has 
worst-case risk tending to zero. 

Proof. We follow the proof of Proposition 3.2. The only difference is 
in (3.1), where we now need a — ?■ and that right-hand side tends to zero 
when loga +2(log A^)/A; — )• — oo. Choose a = N^'^^^y/rj, obtaining that, with 
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high probabihty, 

(3.2) rnaxX'^(I- A^^)X < -^-^A^-2/''A;^ + 21ogiV. 

As before, with high probabihty under P5, 

(3.3) A^(I - A-^)A > -pA;, 

so we only need to check that the threshold t is larger than the right-hand 
side in (3.2) and smaller than the right-hand side in (3.3), which is the case 
by the assumptions we made. □ 

Notice that in Proposition 3.3 the condition on p implies that p — ?• 1, which 
is much stronger than what the squared-sum test requires when fc^/n — )• 00. 
For fc-sets, N = (^) — so that log = klog{n/k) -\- 0{k) — and the require- 
ment is that (1 — p){n/k)'^ \og{n/k) — t- 0, which is substantially stronger than 
what the lower bound obtained in Corollary 2.3 requires. Moreover, if we 
restrict p to be bounded away from 1, then the GLRT may be powerless. 

Theorem 3.1. LetC he the class of all k-sets. If p < 0.6 and k = o{nP-^), 
the GLRT has a Bayes risk hounded away from zero. 

The proof is in the Appendix. 

In view of Theorem 3.1, the GLRT is clearly suboptimal when in the 
situation stated there, and compares very poorly with the squared-sum test, 
which is asymptotically powerful if pk"^ /n^ 00 as seen in Proposition 3.1. 
We do not know of any other situation where the GLRT fails so miserably. 

3.3. A localized squared-sum test. While the GLRT is near-optimal for 
detecting objects from a parametric class such as /c-intervals, it needs knowl- 
edge of p. However, a simple modification solves this drawback. Indeed, con- 
sider the following "local" squared-sum test: 

/(A) = if and only if max < t 



for some appropriate threshold t. 



Proposition 3.4. Consider a class C of sets of size k, with cardinality 
A->oo such that log(A)/A; 0. When p>log(A)//c in (I.4), the local 
squared-sum test with threshold t = 2klogN has worst-case risk tending to 
zero. 

Proof. The proof is quite straightforward. Indeed, under the null, for 
any S of size k we have X^jg^ Aj ~ AA(0, /c) so that 

2 



max! > A, I < t 
Sec V ^ / ~ 
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with probability tending to 1. Under an alternative (1.4), S denoting the 
anomalous set of variables, we have 



when p>log(Af)/fc. □ 

Specializing this result to the case of A;-intervals leads to the following 
statement (which ignores logarithmic factors): 

The localized squared-sum test is near-optimal for detecting k- 
intervals in the regime where log{n)/k — t- 0. 

When k is unknown. We might only know that some interval is anomalous, 
without knowing the size of that interval. In that case, multiple testing at 
each k using the local squared-sum test yields adaptivity. Computationally, 
this may be done effectively by computing sums in a multiscale fashion as 
advocated in [6]. In fact, here it is enough to compute the sums over all 
dyadic intervals — since each interval S contains a dyadic interval of length 
at least 151/4 — and this can be done in 3n flops in a recursive fashion. 

3.4. A goodness- of- fit test. By now, the parametric case is essentially 
solved, with the local squared-sum test being not only near-optimal but also 
computable in polynomial time (in n and k) for the case of /c-intervals, for 
example. In the nonparametric case, so far, the story is not complete. We 
focus on the class of all A;-sets. There we know that the squared-sum test 
is near-optimal if /c^/n — )■ oo. If fc^/n — )• 0, it has no power, and we only 
know that the GLRT works when (1 — p){n/k)'^\og{n/k) — t- 0, which does 
not match the rate obtained in Corollary 2.3. Worse than that, it is not 
clear whether computing the GLRT is possible in time polynomial in (n, k). 
We now show that a simple goodness-of-fit (GOF) test performs (almost) 
as desired. 

The basic idea is the following. Let Hi = $"^(Xj), where $ is the standard 
normal distribution function. Under the null, the HiS are i.i.d. uniform in 
(0, 1). Under an alternative with anomalous set denoted by S", the Xi,i £ S 
are closer together, especially since we place ourselves in the regime where 
1. More precisely, we have the following. 

Lemma 3.1. Suppose Xi, . . . , are zero-mean, unit-variance random 
variables satisfying Cov{Xi,Xj) > p> 0, for all i ^ j . Let X denote their 
average. Then for any t>0, 




F{i^{i:\Xi-X\>t} > k/2] < 



2(1 -p) 
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Proof. Let A := Xli^j Cov(Xj,Xj) > k{k - l)p. Elementary calcula- 
tions show that 



E 



k 

By Markov's inequality, we then have 



l-l-^<{l-l/k){l-p)<l-p. 



The statement follows from observing that 

#{i:\Xi-X\>t}>k/2 ^ j^iXi-Xf>t^/2. 



□ 



The idea, therefore, is detecting unusually high concentrations of Hi^s, 
which is a form of GOF test for the uniform distribution. Under a general 
correlation model as in (1.4), with Lemma 3.1 we see that the concentration 
will happen over an interval of length slightly larger than y/1 — p. This is 
apparent from Lemma 1.1 under the simple correlation model (1.1). 

Choose an integer m such that (ji/fc^) log(n//c^) and partition the 

interval [0, 1] into m bins of length 1/m, denoted /g, s = 1, . . . , m. Let Bs = 
: Hi G Is} be the bin counts — thus, we are computing a histogram. Then 
consider the following GOF test: 

f{X) =0 if and only if max Bg < t, 

s=l,...,m 

where t is some threshold. 

Proposition 3.5. Consider the class C of all k-sets in the case where 
A:^/n — 7- and A;/ log n — t- oo . In the GOF test above, choose m such that 
(n/A;^)logn <C m ^ n/logn. When {1 — p)^^^ 1/m in (1-4), the resulting 
test with threshold t = n/m + ^J'in \og{m)/m has worst-case risk tending to 
zero. 



Proof. Bernstein's inequality, applied to the binomial distribution, gives 
that 

¥q{Bs > n/m + by^n/m} < exp[-(6V2)/(l + {b/3)^/m/n)]. 
This and the union bound imply that, indeed, 

Po{niaxS5 >t| ^0. 

Consider now an alternative of the form (1.4), with S denoting the anoma- 
lous set. Let 

I:={i€S:\Xi-Xs\<l/m}, Xs:=^^X,. 

ies 
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Though the set / is random, by Lemma 3.1 and the fact that (1 — pY^'^ <^ 
1/m, we have that 

P5{|I| > k/2} ^ 1. 

Define the event Q := {—a < Xs < a} for some a > 0. Note that, since the 
variance of X5 is bounded by 1, P(Q^) < 2(1 -$(a)). Define Hs = <^-'^(Xs)- 
On Q, using a simple Taylor expansion, we have 

(p{a + l/m) 

where (p denotes the standard normal density function and a is taken suf- 
ficiently large. Therefore, when |/| > k/2 and Q hold, at least k/2 of the 
anomalous Hi^s fall in an interval of length at most 2e" /m. Since such 

2 

an interval is covered by at most 2e" bins, by the pigeonhole principle, 
there is a bin that contains ke~°' /4 anomalous Hi^s. By Bernstein's in- 
equality, the same bin will also contain at least (n — k) /m — \/3n log{m)/m 
nonanomalous ifj's (with high probability), so in total this bin will con- 
tain n/m — k/m — Y^3nlog(m)/m -|- ke~°'^ /A points. By our choice of m, 
k ^ ^Jn log(m)/m, so it suffices to choose a -^00 slowly enough that 
ke~°'^ ^> ^Jn \og{m)/m still. Then, with high probability, there is a bin with 
more than t points. □ 

Ignoring logarithmic factors, we are now able to state the following: 

The GOF test is near-optimal for detecting k-sets in the regime 
where k'^/n — )■ and k/logn — )■ 00. 

When k/logn — t- 0, things are somewhat different. There, the GOF test 
requires that (1 — p)n^^/^^~'^'^ — )• 0, which is still close to optimal when k — )• 
00, but far from optimal when k is bounded (e.g., when k = 2, the exponent 
is 4 instead of 2). Indeed, when A;/logn — >• 0, m needs to be chosen larger 
than n, and Bernstein's inequality is not accurate. Instead, we use the simple 
bound 

P(Bin(n,p) > £) < 2^^|p- when np < 1/2. 

Note that Bennett's inequality would also do. (The analysis also requires 
some refinement showing that, with probability tending to 1 under the al- 
ternative, one cell contains at least k points.) Note that in the remaining 
case, k = 0(1), the GLRT is optimal up to a logarithmic factor, since it only 
requires that (1 — /?)n^logn — )• 0, as seen in Section 3.2.2. We do not know 
whether a comparable performance can be achieved by a test that does not 
have access to p. 

When k is unknown. In essence, we are trying to detect an interval with 
a higher mean in a Poisson count setting. As before, it is enough to look 
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at dyadic intervals of all sizes, which can be done efficiently as explained 
earlier, following the multiscale ideas in [6]. 

APPENDIX: PROOF OF THEOREM 3.1 

The proof is divided into three steps. The first step formalizes the fact that 
we want to prove that (under Hi), the contaminated set has no influence 
(with high probability) on the GLRT statistic. The second step exhibits 
a useful high probability event. Finally, in the third step we show that on 
this high probability event, the contaminated set has no influence on the 
GLRT. 

It can easily be seen that for every S of size k, 
Introduce the function g( : M'^ — )• M defined by 

(n \ 2 n 
^UiJ - {1 + p{k - 
i=l / i=l 

for u = (ui, . . . , lifc) G M^. Denoting, for x G M" and S C {1, . . . , n}, the vector 
of components of x belonging to S by x\s, we may write the GLRT as 

/(x) = if and only if ina,xg{x\s) < t. 
Note that by the symmetry of C and the test, 

R{f) = Po{max <7(X|s) >t} + ^Y.^S' {max <7(X|s) < t] 

s'cc 

= Po{max<7(X|s) > t} + F|i,...,fc|{maxg(X|5) < t}. 

Given X ~ A/'(0,I), define the coupling X' as follows: Xi = X'- for i ^ 
{1, . . . , /c}, and Xi,X'j^ are independent for i E {1, . . . ,k}. Note that X' ~ 
A/'(0, Aji Then, no matter what the threshold t is, we have 

R{f) = p{max5(X|s) > i} + ¥{m^xg{X'\s) < t] 

> F\maxg{X\s) > maxg{X'\s)}. 

In the following we show that, with probability tending to 1, we have 

iaaKg{X\s) = rnax5r(X'|s), 

which then implies that the GLRT is asymptotically powerless. 
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By Lemma 1.1, there exist U,U\, . . . ,Uk independent standard normal 
such that for all i G {1, . . . , fc}, 



Using the fact that maxj=i^...^fc |C/j| < ^J2\ogk with high probabihty, with 
probability tending to 1, we have 

x;,...,x^G[-c,c], 



where C := \/2(l — p) Xogiojkk) and ujk is any sequence such that — )■ oo. 

Fix 7 > 1 to be determined later and define p = ^{C <U < 7C} where 
U ~A/'(0,1). By the fact that are i.i.d. standard normal, Z := 

#{i:C,<Xi< 7C} ~ Bin(n,p), so that P{Z >k]^ 1 if /c = o{np). When 7 
is bounded away from 1, this is the case if ^log kk'^~f = o{n). 

In conclusion, we proved that the event 

n = {X{,...,Xl e (-C,C),3ai,...,afc,/3i,...,/3fc G {l,...,n} distinct: 

Xai_ , ■ • ■ , Xa^^ , -X^^ , . . . , -X^i^ G {(, 7C)} 

has a probability that tends to 1 if y/logkk'^~^ = o{n) as long as 7 is bounded 
away from 1. 

We specify 7 = l/^ p + {-j^ + p)^- Note that, as required, 7 exceeds and 
is bounded away from 1. Assume that we are on the event $7. First note 
that 

g{X^, ,...,X^^)>k{k-l)C^- p{k - l)k^^C^ 

(A.l) 

= fc(fe-l)C'(l-p7'), 

and the same holds for g{Xp^ , . . . , Xp^). 

Let S G C be such that S'n{l,...,A;}7^0. We want to show that there 
exists S' such that g{X\s>) > g{X'\s). This entails that mscxsec g{^\s) ^ 
max^gc (^(X'l^), since for S'njl, . . . , fc} = we have g{X\s) = g{X'\s). First 
remark that we can assume that 

(A.2) {y,x[) >ak-i)V^ 

since otherwise by (A.l) we can simply take S' = {ai, . . . , ak}- To simplify 
notation, we may assume that 1 G 5 n {l,...,k}. By definition of and 
the fact that S contains at least one index in {1, . . . ,k}, there exist u,v £ 
{1, . . . ,k} such that Xa^ and do not appear in X'\s- We want to show 
that by replacing X[ by either Xa^ or , in X'\s, one increases the value 
of g. More precisely, we want to show that 

max(5(X„„,X'|s\{i}),5(X/3„,X'ls\{i}))>5(X'|s). 
Then by induction one can show the existence of the 5' described above. 
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Note that, for x G M'' and y G M, 

g{xi Xj-i,y, Xj+i ,...,Xk)- g{x) 

= 2{y -xj)Y, Xi - p{k - _ x)) 



= {y- Xj) i2Y,Xi-{2 + p{k- l))xj - p{k - l)y 



i=l 



Consider the case where Yli&s -^i ^ ^ i^^^ ^^^^ SieS -^'i ^ dealt 
with similarly). Since Xa^ > it suffices to show that X'^ > (2 + 

p{k — 1))^^ + p{k — l)Xa^, which follows from 



This concludes the proof. 
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